arXiv:1506.06422v2 [stat.ML] 13Jul2015 


Beyond Hartigan Consistency: 

Merge Distortion Metric for Hierarchical Clustering 


Justin Eldridge Mikhail Belkin 

eldridge@cse.ohio-state.edu mbelkin@cse.ohio-state.edu 

Yusu Wang 

yusu@cse.ohio-state.edu 


Abstract 

Hierarchical clustering is a popular method for analyzing data which associates a tree to 
a dataset. Hartigan consistency has been used extensively as a framework to analyze such 
clustering algorithms from a statistical point of view. Still, as we show in the paper, a tree 
which is Hartigan consistent with a given density can look very different than the correct limit 
tree. Specifically, Hartigan consistency permits two types of undesirable configurations which 
we term over-segmentation and improper nesting. Moreover, Hartigan consistency is a limit 
property and does not directly quantify difference between trees. 

In this paper we identify two limit properties, separation and minimality, which address both 
over-segmentation and improper nesting and together imply (but are not implied by) Hartigan 
consistency. We proceed to introduce a merge distortion metric between hierarchical clusterings 
and show that convergence in our distance implies both separation and minimality. We also 
prove that uniform separation and minimality imply convergence in the merge distortion metric. 
Furthermore, we show that our merge distortion metric is stable under perturbations of the 
density. 

Finally, we demonstrate applicability of these concepts by proving convergence results for 
two clustering algorithms. First, we show convergence (and hence separation and minimality) 
of the recent robust single linkage algorithm of |Chaudhuri and Dasgupta| ( |2010| l. Second, we 
provide convergence results on manifolds for topological split tree clustering. 


1 Introduction 


Hierarchical clustering is an important class of techniques and algorithms for representing data in 
terms of a certain tree structure ( |Jain and Dubes[ 19881. When data are sampled from a probability 
distribution, one needs to study the relationship between trees obtained from data samples to the 
infinite tree of the underlying probability density. This question was first explored in [Hartigan 


(19751, which introduced the notion of high-density clusters. Specifically, given densify funcfion / : 
A’ —)• M, fhe high-densify clusfers are defined fo be fhe connecfed componenfs of {x £ X : f{x) > 
A} for some A. The sef of all clusfers forms a hierarchical sfrucfure known as fhe density cluster 
tree of /. The nafural nofion of consisfency for finife densify esfimafors is is fo require fhaf any 
fwo high densify clusfers are also separafe in fhe finife free given enough samples. This nofion was 
infroduced in |Harfig^ ( 1981| l and is known as Hartigan consistency. Sfill, while clearly desirable, 
if is well known fhaf Hartigan consisfency does nof fully capfure fhe properfies of convergence fhaf 
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one would a priori expect. In particular, it does not exclude trees which are very different from the 
underlying probability distribution. 

In this paper we identify two distinct undesirable configuration types permitted by Hartigan con¬ 


sistency, over-segmentation (identified as the problem of false clusters in |Chaudhuri et al.[|20l4| ) and 
improper nesting, and show how both of these result from clusters merging at the wrong level. To 
address these issues we propose two basic properties for hierarchical cluster convergence: minimal¬ 
ity and separation. Together they imply Hartigan consistency and, furthermore, rule out “improper” 
configurations. We proceed to introduce a merge distortion metric on clustering trees and show that 
convergence in the metric implies both separation and minimality. Moreover, we demonstrate that 
uniform versions of these properties are in fact equivalent to metric convergence. We note that the 
introduction of a quantifiable merge distortion metric also addresses another issue with Hartigan 
consistency, which is a limit property of clustering algorithms and is not quantifiable as such. We 
also prove that the merge distortion metric is robust to small perturbations of the density. 

Still, attempts to formulate some intuitively desirable properties of clustering have led to well- 


known impossibility results, such as those proven by Kleinberg (20031. In order to show that our 
definitions correspond to actual objects, and, furthermore, to realistic algorithms, we analyze the ro¬ 
bust single linkage clustering proposed by Chaudhuri and Dasgupta| ( [2010[ |. We prove convergence 
of that algorithm under our merge distortion metric and hence show that it satisfies separation and 
minimality conditions. We also propose a topological split tree algorithm for hierarchical clustering 
(based on the algorithm introduced by Chazal et al. ( 2013| ) for fiat clustering) and demonstrate its 
convergence on Riemannian manifolds. 


Previous work. The problem of devising an algorithm which provably converges to the true den¬ 
sity cluster tree in the sense of Hartigan has a long history. Hartigan] (19811 proved that single 


linkage clustering is not consistent in dimensions larger than one. Previous to this, Wishart| ( 1969| ) 
had introduced a more robust version of single linkage, but its consistency had not been known. 
Stuetzle and Nugent] ( |2010| l introduced another generalization of single-linkage designed to estimate 
the density cluster tree, but again consistency was not established. Recently, however, two distinct 


consistent algorithms have been introduced: The robust single linkage algorithm of Chaudhuri and 
Dasgupta (20101, and the tree pruning method of Kpotufe and Luxburg (|2011|). Both algorithms 


are analyzed together, along with a pruning extension, in Chaudhuri et al. (2014 1 . The robust single 


linkage algorithm was generalized in Balakrishnan et al. ( 2013 1 to densities supported on a Rieman¬ 
nian submanifold of We analyze the algorithm of Chaudhuri and Dasgupta (2010) in Section|^ 
Chaudhuri and Dasgupta (2010) provide several theorems which make precise the sense in which 


clusters are connected and separated at each step of the robust single linkage algorithm. This paper 
translates their results to our formalism, thereby proving that robust single linkage converges to the 
density cluster tree in the merge distortion metric. 

A central contribution of this paper will be to introduce notions which extend Hartigan consis¬ 
tency, and are desirable properties of any algorithm which estimates the density cluster tree. In a 


related direction, Kleinberg (20031 outlined three desirable properties of a clustering method, and 
proved that no method satisfying all three exists. Ben-David and Ackerman ( 2009[ l argued that the 
impossibility result of Kleinberg is tied to his formalism, and showed that axioms similar to his 
can be made consistent by axiomatizing clustering quality measures as opposed to clustering func¬ 
tions themselves. Zadeh and Ben-David (2009|l and Ackerman et al. (2010|l presented axiomatic 


characterizations of linkage-based clustering algorithms. Similarly, Carlsson and Memoli (2010) 
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mtxodviCQd functoriality as one of three axioms related to Kleinberg’s and showed that single link¬ 
age agglomerative clustering is the only method which simultaneously satisfies each. 


2 Preliminaries and definitions 


A clustering C of a set X is the organization of its elements into a collection of subsets of X called 
clusters. In general, clusters may overlap or be disjoint. If the collection of clusters exhibits nesting 
behavior (to be made precise below), the clustering is called hierarchical. The nesting property 
permits us to think of a hierarchical clustering as a tree of clusters, henceforth called a cluster tree. 

Definition 1 (Cluster tree). A cluster tree (hierarchical clustering) of a set X is a collection C of 
subsets of X s.t. X G C and C has hierarchical structure. That is, if C,C' G C such that C 7 ^ C, 
then C r\C' = ij), or C <Z C or C G1 C. Each element C ofC is called a cluster. Each cluster C is 
a node in the tree. The descendants of C are those clusters C G C such that C C C. Every cluster 
in the tree except for X itself is a descendant of X, hence X is called the root of the cluster tree. 


Note that our definition of a cluster tree does not assume that either the set of objects X or the 
collection of clusters C are finite or even countable. Hierarchical clustering is commonly formulated 
as a sequence of nested partitions of X (Jain and Dubes 1988[ see), culminating in the partition of 
X into singleton clusters. Our formulation differs in that it is a sequence of nested partitions of 
subsets of X. Notably, we don’t impose the requirement that {x} appear as a cluster for every x. 

Given a density / supported on X C a natural way to cluster X is into regions of high 
density. [Hartigan (19751 made this notion precise by defining a high-density cluster of / fo be a 
connecfed componenf of fhe superlevel sef {f > X} := {x G X : f{x) > A} for any A > 0. If is 
clear fhaf fhis clusfering exhibifs fhe nesting properly: If C is a connecfed componenf of {/ > A}, 
and C is a connecfed componenf of {/ > A'}, Ihen eilher C C C', C" C C, or (7 n C' = 0. We 
can fherefore inlerprel the set of all high-density clusters of a density / as a cluster tree: 


Definition 2 (Density cluster tree of /). Let C and consider any / : A" —)• M. The density 
cluster tree of f, written Cf, is the cluster tree whose nodes (clusters) are the connected components 
of{xGX : f(x) > X} for some A > 0. 


We note that the density cluster tree of / is closely related to the so-called split tree studied in the 
computational geometry and topology literature as a variant of the contour tree', see e.g, (|Carr et ah 


20031. We discuss a split tree-based approach to estimating the density cluster tree in Section]^ 


In practice we do not have access to the true density /, but rather a finite collection of samples 
Xn C X drawn from /. We may attempt to recover the structure of the density cluster tree Cf 
by applying a hierarchical clustering algorithm to the sample, producing a discrete cluster tree Cf^n 
whose clusters are subsets of X^. In order to discuss the sense in which the discrete estimate Cf^n is 
consistent with the density cluster tree C/ in the limit n ^ 00 , Hartigan (1981 1 introduced a notion 


of convergence which has since been referred to as Hartigan consistency. We follow Chaudhuri and 


Dasgupta (2010 1 in defining Hartigan consisfency in terms of the density cluster tree: 


Definition 3 (Hartigan consistency). Suppose a sample Xn C X of size n is used to construct a 
cluster tree Cf^n that is an estimate ofCj. Eor any sets A, A' C X, let An (respectively A'n) denote 
the smallest cluster ofCf^n containing A n Xn (respectively, A' n Xn). We say Cf^n consistent if. 
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Figure 1: The density has a tree-like structure in which a and a' merge at level /r. 


whenever A and A' are different connected components of {x € X : f{x) > X} for some A > 0, 
Vi{An is disjoint from A'ff) —)■ 1 n —)■ oo. 

In what follows, it will be useful to talk about the “height” at which two points in a clustering 
merge. To motivate our definition, consider the two points a and a' which sit on the surface of the 
density depicted in Figure[^ Intuitively, a sits at height /(a) on the surface, while a' sits at /(o'). 
If we look at the superlevel set {/ > /(a)}, we see that a and a' lie in two different high-density 
clusters. As we sweep A < /(a), the disjoint components of {/ > A} containing a and a' grow, 
until they merge at height p. We therefore say that the merge height of a and a' is p. 

We may also interpret the situation depicted in Figure [T] in the language of the density cluster 
tree. Let A be the connected component of {/ > /(a)} which contains a, and let Af be the 
component of {/ > f{a')} containing a'. Recognize that A and A' are nodes in the density cluster 
tree. As we walk the unique path from A to the root, we eventually come across a node M which 
contains both a and a'. Note that M is a connected component of the superlevel set {/ > p}. It is 
desirable to assign a height to the entire cluster M, and a natural choice is therefore p. 

We extend this intuition to cluster trees which may not, in general, be associated with a density 
/ by introducing the concept of a height function: 

Definition 4 (Cluster tree with height function). A cluster tree with a height function is a triple 
C = {X, C, h), where X is a set of objects, C is a cluster tree of X, and h : X ^ is a height 
function mapping each point in X to a “height”. Furthermore, we define the height of a cluster 
C € C to be the lowest height of any point in the cluster. That is, h{C) = h{x). Note that the 

nesting property ofC implies that if C is a descendant ofC in the cluster tree, then h{C') > h{C). 

We will be consistent in using Cy to denote the density cluster tree of / equipped with height 
function /. That is, Cy = (A, Cy, /). Armed with these definitions, we may precisely discuss the 
sense in which points - and, by extension, clusters - are connected at some level of a tree: 

Definition 5. Let C = {X, C,h) be a hierarchical clustering of X equipped with height function h. 

1. Let x,x' ^ X. We say that x and x' are connected at level A if there exists a C € C with 
x,x € C such that h{C) > A. Otherwise, x and x' are separated at level A. 

2. A subset S C X is connected at level A if for any s, s' G S, s and s' are connected at level A. 

3. Let S C X and S' C X. We say that S and S' are separated at level A if for any s £ S, 
s' G S', s and s' are separated at level A. 
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We can now formalize the notion of merge height: 

Definition 6 (Merge height). Let C = (X, C, h) be a hierarchical clustering equipped with a height 
function. Let x,x' G X, and suppose that M is the smallest cluster of C containing both x and 
x'. That is, if M' G C is a proper sub-cluster of M, then x ^ M' or x' 0 M'. We define the 
merge height of x and x' in C, written mc{x,x'), to be the height of the cluster M in which the 
two points merge, i.e., mc{x,x') = h{M). If S G X, we define the merge height of S to be the 
inf(s,s')e5x5"ic('S,s')- 

In what follows, we argue that a natural and advantageous definition of convergence to the true 
density cluster tree is one which requires that, for any two points x, x', the merge height of x and x' 
in an estimate, (x, x'), approaches the true merge height mcj- (x, x') in the limit n ^ oo. 

3 Notions of consistency 

In this section we argue that while Hartigan consistency is a desirable property, it is not sufficient to 
guarantee that an estimate captures the true cluster tree in a sense that matches our intuition. We first 
illustrate the issue by giving an example in which an algorithm is Hartigan consistent, yet produces 
results which are very different from the true cluster tree. We then introduce a new, stronger notion 
of consistency which directly addresses the weaknesses of Hartigan’s definition. 

The insufficiency of Hartigan consistency. An algorithm which is Hartigan consistent can nev¬ 
ertheless produce results which are quite different than the true cluster tree. Figure [^illustrates the 
issue. Figure j^a) depicts a two-peaked density / from which the finite sample X„ is drawn. The 
two disjoint clusters A and B are also shown. The two trees to the right represent possible outputs 
of clustering algorithms attempting to recover the hierarchical structure of /. Figure [^b) depicts 
what we would intuitively consider to be an ideal clustering of X„, whereas Figure [^c) shows an 
undesirable clustering which does not match our intuition behind the density cluster tree of /. 

First, note that while the two clusterings are very different, both satisfy Hartigan consistency. 
Hartigan’s notion requires only separation: The smallest empirical cluster containing A n Xn must 
be disjoint from the smallest empirical cluster containing BnXn in the limit. The smallest empirical 
cluster containing A n X^ in the undesirable clustering is A^ := { 3 : 2 , 01 , 02 , 03 }, whereas the 
smallest containing B n Xn is Bn '■= {3:1,61,62,63}. An and Bn are clearly disjoint, and so 
Hartigan consistency is not violated. In fact, the undesirable tree separates any pair of disjoint 
clusters of /, and therefore represents a possible output of an algorithm which is Hartigan consistent 
despite being quite different from the true tree. 

We will show that the undesirable configurations of Figure [^c) arise because Hartigan consis¬ 
tency does not place strong demands on the level at which a cluster should be connected. Consider 
a cluster A occurring at level A of the true density, and let be the smallest empirical cluster 
containing all of A n X„. In the ideal case, an algorithm would perfectly recover A such that 
An = AG Xn- It is much more likely, however, that An contains “extra” points from outside of A. 
Hartigan consistency places one constraint on the nature of these extra points: They may not belong 
to some other disjoint cluster of /. However, Hartigan’s notion allows An to contain points from 
clusters which are not disjoint from A. By their nature, these points must be of density less than 
A. If An contains such extra points, then A n Xn is separated at level A, and in fact only becomes 
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Figure 2: (c) is Hartigan consistent, yet looks rather different than the true tree. 


connected at level minag^^ /(a) < 6. Therefore, permitting AnX^ to become connected at a level 
lower than A is equivalent to allowing “extra” points of density < A to be contained within An- 

The undesirable configurations depicted in Figure |^c) can be divided into two distinct cate¬ 
gories, which we term over-segmentation and improper nesting. Either of these issues may exist 
independently of the other, and both are symptoms of allowing clusters to become connected at 
lower levels than what is appropriate. 

Over-segmentation occurs when an algorithm fragments a true cluster, returning empirical clus¬ 
ters which are disjoint at level A but are in actuality part of the same connected component of 
{/ > A}. The problem is recognized in the literature by Chaudhuri et al. ( 2014| , who refer to it as 
the presence of false clusters. Figurej^c) demonstrates over-segmentation by including the clusters 
An := {x 2 , ai, 02 , 03 } and Bn ■= {xi, 61 , 62 , ^ 3 }- An and Bn are disjoint at level /(xi), though 
both are in actuality contained within the same connected component of {/ > /(xi)}. 

It is clear that over-segmentation is a direct result of clusters connecting at the incorrect level. 
The severity of the issue is determined by the difference between the levels at which the cluster 
connects in the density cluster tree and the estimate. That is, if A is connected at A in the density 
cluster tree, but A n Xn is only connected at A — in the empirical clustering, then the larger 5 the 
greater the extent to which A is over-segmented. 

Improper nesting occurs when an empirical cluster Cn is the smallest cluster containing a point 
X, and /(x) > mincgc„ /(c). The clustering in Figure |^c) displays two instances of improper 
nesting. First, the left branch of the cluster tree has improperly nested the cluster { 01 , 02 }, as 
it is the smallest cluster containing 02, yet /(oi) < /(02). The right branch of the same tree 
has also been improperly nested in a decidedly “lazier” fashion: the cluster {xi, 61 , 62 , 63 } is the 
smallest empirical cluster containing each of 61 , 62 , and 63 , despite each being of density greater 
than /(xi). Improper nesting is considered undesirable because it breaks the intuition we have 
about the containment of clusters in the density cluster tree; Namely, if ^4 C and a £ A, a' £ A', 
then /(o) > /(o'). 

Note that like over-segmentation, improper nesting is caused by a cluster becoming connected 
at a lower level than is appropriate. For instance, suppose Cn is improperly nested; That is, it is the 
smallest empirical cluster containing some point x such that /(x) > mincgc„ /(c)- Let C be the 
connected component of {/ > /(x)}, and let Cn be the smallest empirical cluster containing all of 
C n Xn. Then Cn C Cn such that /(c) < /(x). In other words, C n Xn is connected 

only below /(x). 

As previously mentioned, it is not reasonable to demand that a cluster A be perfectly recovered 
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by a clustering algorithm. Rather, if A is connected at level A in the density cluster tree, we should 
allow A n Xn to be first connected at a level A — 5 in the estimate, for some small positive S. We 
make this notion precise with the following definition: 

Definition 7 ((i-minimal). Let Abe a connected component of {x £ X : f{x) > A}, and let Cf^n 
be an estimate of the density cluster tree of f computed from finite sample X^. A is (5-minimal in 
Cf^n if Xn is connected at level X — 6 in C/^n- 

Intuitively, each cluster of the density cluster tree should be J-minimal in an empirical clustering 
for as small of a 5 as possible. For example, take any sample x £ Xn and let C be the connected 
component of {/ > f{x)} containing x. Some examination shows that C is 0-minimal in the 
ideal clustering depicted in Figure [^b). As the ideal clustering is free from over-segmentation and 
improper nesting, it stands to reason that a cluster can only exhibit these issues to the extent that it is 
(5-minimal; The larger 6 , the more severely a cluster may be over-segmented or improperly nested. 

Minimality and separation. We have identified fwo senses - over-segmentafion and improper 
nesfing - in which a hierarchical clusfering mefhod can produce resulfs which are inconsisfenf wifh 
fhe densify clusfer free, buf which are nof prevenfed by Harfigan consisfency. We have shown fhaf 
bofh are symptoms of clusfers becoming connecfed af fhe improper level, and argued fhaf fhe exfenf 
to which a cluster is J-minimal controls the amount in which it is over-segmented or improperly 
nested. With more and more samples, we’d like the extent to which a clustering exhibits over¬ 
segmentation and improper nesting to shrink to zero. We therefore introduce a notion of consistency 
which requires any cluster to be (5-minimal with (5 —)• 0 as n —>• oo. 

In the following, suppose a sample Xn C A of size n is used to construct a cluster tree Cf^n that 
is an estimate of Cf^n, and let be Cy „ equipped with / as height function. Furthermore, it is 
assumed that each definition holds with probability approaching one as n —>• oo. 

Definition 8 (Minimality). We say that Cf^n ensures minimality if given any connected component 
A of the superlevel set {x £ X : f{x) > A}/or some A > 0, A n Xn is connected at level X — 5 in 
Cf^nfor any 6 > 0 as n ^ oo. 

Minimality concerns the level at which a cluster is connected - it says nothing about the ability 
of an algorithm to distinguish pairs of disjoint clusters. For this, we must complement minimality 
with an additional notion of consistency which ensures separation. Hartigan consistency is suffi¬ 
cient, but does not explicitly address the level at which two clusters are separated. We will therefore 
introduce a slightly different notion, which we term separation: 

Definition 9 (Separation). We say that Cf^n ensures separation if when A and B are two disjoint 
connected components of{f > A} merging at p, = mc^(Aui?), AnXn and BCiXn are separated 
at level p + 6 in Cf^nfor any 5 > 0 as n ^ oo. 

It is interesting to note that Hartigan consistency contains some weak notion of connectedness, 
as it requires the two sets A n Xn and B n Xn to be connected into clusters An and Bn at the same 
level at which they are separated. Our notion only requires that A n Xn and B n Xn be disjoint 
at this level. We “factor out” Hartigan consistency’s idea of connectedness, leaving separation, and 
replace it with a stronger notion of minimality. 

Taken together, minimality and separation imply Hartigan consistency. 
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Theorem 1 (Minimality and separation Hartigan consistency). If a hierarchical clustering 
method ensures both separation and minimality, then it is Hartigan consistent. 

Proof. Let A and A' be disjoint connected components of the superlevel set {x G X : f{x) > A} 
merging at level p,. Pick any X — p > S > 0. Definitions!^ and [pimply that there exists an N such 
that for all n > ^ n Xn and A' n Xn are separated and individually connected at level p + 5. 

Assume n > N. Let A^ be the smallest cluster containing all of A n Xn, and be the smallest 
cluster containing all of A' n Suppose for a contradiction that there is some x G Xn such that 
X G An n A'n. Then either A„ C A'„ or A^ C An. In either case, there is some cluster C such that 
h{C) > p + 6, An <Z C, and A^ C C. Since A n Xn C An and A' n Xn C A'^, this contradicts 
the assumption that A n Xn and A' n Xn are separated at level p + 6. Hence An n A'„ = 0. □ 

Minimality and separation have been defined as properties which are true for all clusters in the 
limit. In addition, we may define stronger versions of these concepts which require that all clusters 
approach minimality and separation uniformly: 

Definition 10 (Uniform minimality and separation). Cf^n ensures uniform minimality if given any 
(5 > 0 there exists an N depending only on 6 such that for all n > N and all X, any cluster 
A G {x G X -. f{x) > X} is connected at level X — 6. Cf^n ^<^id to ensure uniform separation if 
given any <5 > 0 there exists an N depending only on 6 such that for all n > N and all p, any two 
disjoint connected components merging in {x G X : f{x) > p} are separated at level p + 6. 

The uniform versions of minimality and separation are equivalent to the weaker versions under 
some assumptions on the density. The proof of the following theorem is given in Appendix |A.1| 

Theorem 2. If the density f is bounded from above and is such that {x G X : f{x) > A} contains 
finitely many connected components for any X, then any algorithm which ensures minimality also 
ensures uniform minimality on f, and any algorithm which ensures separation also ensures uniform 
separation. 

In the next section, we will introduce a distance between hierarchical clusterings, and show that 
convergence in this metric implies these consistency properties. 

4 Merge distortion metric 

The previous section introduced the notions of minimality and separation, which are desirable prop¬ 
erties for a hierarchical clustering algorithm estimating the density cluster tree. Like Hartigan con¬ 
sistency, minimality and separation are limit properties, and do not directly quantify the disparity 
between the true density cluster tree and an estimate. We now introduce a merge distortion metric 
on cluster trees (equipped with height functions) which will allow us to do just that. 

We make our definitions specifically so that convergence in the merge distortion metric implies 
the desirable properties of minimality and separation. Specifically, consider once again the density 
depicted in Figure [T] Suppose we run a cluster tree algorithm on a finite sample Xn drawn from 
/, obtaining a hierarchical clustering Cf^n- Let „ be this clustering equipped with / as a height 
function. We may then talk about the height at which two points merge in Cy „, and of the level 
at which clusters are connected and separated in Cf^n- These are the concepts required to discuss 
minimality and separation. 




Suppose that the algorithm ensures minimality and separation in the limit. What can we say 
about the merge height of a and a' in Cj^n as n —oo? First, minimality will suggest that M n Xn 
be connected in at level fj, — 6, with <5 —)• 0, where is as it appears in Figure This implies 
that the merge height of a and a' is bounded below by — 5, with 5 —0. On the other hand, 
separation implies that A n and A' n Xn be separated at level /r + <5, with <5 —> 0. Therefore 
the merge height of a and a' is bounded above by /r + 5, with <5 —)■ 0. Hence in the limit n ^ oo, 
the merge height of a and a' in C/-„, written nif- (a, a'), must converge to u, which is otherwise 
known as mcj- {a, a'): the merge height of a and a' in the true density cluster tree. 

With this as motivation, weTl work backwards, defining our distance between clusterings in 
such a way that convergence in the metric implies that the merge height between any two points in 
the estimated tree converges to the merge height in the true density cluster tree. WeTl then show 
that this entails minimality and separation, as desired. 


Merge distortion metric. Let Ci = {Xi,Ci,hi) and C 2 = {X 2 ,C 2 ,h 2 ) be two cluster trees 
equipped with height functions. Recall from Definition that each cluster tree is associated with its 
own merge height function which summarizes the level at which pairs of points merge. We define 
fhe disfance befween Ci and C 2 in ferms of fhe disforfion befween merge heighfs. In general, Ci and 
C 2 clusfer differenf sefs of objecfs, so we will use fhe disforfion wifh respecf fo a correspondency 
befween fhese sefs. 


Definition 11 (Merge distortion metric). Let Ci = {Xi,Ci,hi) and C 2 = (-^ 2 ,^ 2 ; ^ 2 ) be two 
hierarchical clusterings equipped with height functions. Let Si C Xi and S 2 C X 2 . Let 7 C 
5i X S 2 be a correspondence between Si and S 2 . The merge distortion distance between Ci and 
C 2 with respect to 7 is defined as 


d^{Qi,Q 2 ) = max \mc^{xi,x'i) - mc 2 {x 2 ,x' 2 )\. 


The above definition is related to the standard notion of the distortion of a correspondence 
between two metric spaces ( Burago et al.[ 20011. We note that if Xi = X 2 and 7 is a correspondence 
between Xi and X 2 , then d-),(Ci, C 2 ) = 0 implies that Ci = C 2 in the sense that the two trees Ci 
and C 2 are isomorphic and the height function for corresponding nodes are identical. 

Now consider the special case of the distance between the true density cluster tree C/ = 
{X,Cf,f) and a finite estimate. Suppose we run a hierarchical clustering algorithm on a sam¬ 
ple Xn <Z X o^ size n drawn from /, obtaining a cluster tree Denote by = {Xn, Cf^n, f) 
the cluster tree equipped with height function /. Then the natural correspondence is induced by 
identity in Xn'. That is, 7 ^ = {(x, x) : x £ Xn}. We then define our notion of convergence to the 
density cluster tree with respect to this correspondence: 


Definition 12 (Convergence to the density cluster tree). We say that a sequence of cluster trees 
{Cf^n} converges to the high density cluster tree Cj of f, written —)■ C y, if for any e > 0 there 
exists an N such that for all n > N, dx^^{Q.f^n, C/) < e. 

* Recall that a correspondence 7 between sets S and S' is a subset of S' x S' such that for Vs G S, 3 s' G S' such that 
(s, s') G 7, and Vs' G S', 3 s G S such that (s, s') G 7. 
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5 Properties of the merge distortion metric 


We now prove various useful properties of our merge distortion metrie. First, we show that eonver- 
genee in the distanee implies both uniform minimality and uniform separation. We then show that 
the eonverse is also true. We eonelude by diseussing stability properties of the distanee. 

Theorem 3. Cj C j implies 1) uniform minimality and 2) uniform separation. 

Proof Our proof eonsists of two parts. 

Part I: Cy „ —> Cy implies uniform minimality. Piek any (5 > 0 and let n be large enough that 
d{Cf, Cf^n) < Let be a eonneeted eomponent of {x G Af : f{x) > A} for arbitrary A. Let 
a,a' ^ An Xn- Then (a, a') > mcf{a, a') — 5. But a and a' are elements of A, sueh that 

mcj- (a, a') > A. Henee (a, o') > A — <5. Sinee a and a' were arbitrary, it follows that A n Xn 

is eonneeted at level X — 6. 

Part II: Cj^n Cy implies uniform separation. Piek any (5 > 0 and let n be large enough that 
d{Cf, Cf^n) < Let A and A' be disjoint eonneeted eomponents of {x G Tf : /(x) > A} for 
arbitrary A. Let p ;= mcf {A U A') be the merge height of A and A! in the density eluster tree. Take 
any a e An Xn and a' ^ A' n Xn- Then (a, a') < mcAa, a') + 5 = p, + 6. Therefore a and 
a' are separated at level p + S. Sinee a and a' were arbitrary, it follows that A n Xn and A' n Xn 
are separated at level p + S. □ 

The eonverse is also true. In other words, eonvergenee in our metrie is equivalent to the eombi- 
nation of uniform minimality and uniform separation. 

Theorem 4. IfCf^n ensures uniform separation and uniform minimality, then Cf^n hi¬ 
proof Take any 5 > 0. Uniform separation and minimality imply that there exists an N sueh that 
for all A any eluster ^ G {x G Tf : /(x) > A} is eonneeted at level A — (5, and for all p any two 
disjoint elusters B, B' merging at p are separated at level p + 5. Assume n > N, and eonsider any 
X, x' G Xn- W.L.O.G., assume /(x') > /(x). We will show that (x, x') — mc^(x, x')| < 6. 

Let A be the eonneeted eomponent of {/ > /(x)} eontaining x, and let A' be the eonneeted 
eomponent of {/ > /(x')} eontaining x'. There are two eases: either A' C A, or A n A' = 0. 

Case I: A' C A. Then mCf (x, x') = /(x). Minimality implies that AnXn is eonneeted at level 
fix) — 5, and therefore (x, x') > fix) — S. On the other hand, elearly mp- (x, x') < fix). 

^/,n ^f,n 

Henee {x-,x') — mcj.(x,x')| < 5. 

Case II: A n A' = 0. Let p := mCf {x, x') be the merge height of x and x' in the density eluster 
tree of /, and suppose that M is the eonneeted eomponent of {/ > p} eontaining x and x'. Then 
separation implies that x and x' are separated at level p +6, sueh that (x, x') < p +6. On the 
other hand, minimality implies that M n Xn is eonneeted at level p — 5,so that (x,x') > p — 6. 

Therefore I(x,x') — mc^(x,x')| < 5. □ 

Stability. An important property to study for a distanee measure is its stability; namely, to quantify 
how mueh eluster tree varies as input is perturbed. We provide two sueh results. 

The first result says that the density eluster tree indueed by a density funetion is stable under our 
merge-distortion metrie with respeet to Loo-perturbation of the density funetion. The seeond result 
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states that given a fixed hierarchical clustering, the cluster tree is stable w.r.t. small changes of the 
height function it is equipped with. The proofs of these results are in Appendix]^ 

Theorem 5 (Loo-stability of true cluster tree). Given a density function / : A — )• M supported on 
A C Mf, and a perturbation f : X ^ of f, let Cf and Cj be the resulting density cluster tree 
as defined in Definition and let Cf := {X,Cf, f) and Cj := {X, Cj, /) denote the cluster tree 

equipped with height functions. We have d.y{Cf, Cj) < ||/ — /||oo. where y C X x X is the natural 
correspondence induced by identity 7 = {(x, x) | a: G A}. 

Theorem 6 (Loo-stability w.r.t. /). Given a cluster tree (X,C), let Ci = {X,C,fi) and { 0)2 = 
{X,C, /2) be the hierarchical clusterings equipped with two height function fi and f 2 , respectively. 
Let y : X X X be the natural correspondence induced by identity on X; that is, 7 = {(x, x) | x G 
X}. We then have dy{Ci, C2) < 2||/i — /2||oo- 

Theorem]^ in particular leads to the following: Given a density / : A —)• M supported on A C 
suppose we have a hierarchical clustering Cn constructed from a sample Xn C X. However, we 
do not know the true density function /. Instead, suppose we have a density estimator producing 
an empirical density function /„ : Xn M. Set = {Xn,Cn,f) as before, and = 

{Xn, Cn, fn)- Theoremimplies that d{Cf^n, Cj J < ||/ - /n||oo- By the triangle inequality, this 
further bounds 

d{Cf, < d{Cf, Cf^n) + 11/ - fnWoo- (1) 

Assuming that the density estimator is consistent, we note that the cluster tree C ^ ^ also converges 

to Cf if Cf^n converges to Cf. This has an important implication from a practical point of view. 
Imagine that we are given a sequence of more and more samples Xn^, Xn 2 , ■ ■ ■, and we construct 
a sequence of hierarchical clusterings C^, Cn^ ,■■ ■■ In practice, in order to test whether the current 
hierarchical clustering converges or not, one may wish to compare two consecutive clusterings C„. 
and Cm+i and measure their distance. However, since the true density is not available, one cannot 
compute the cluster tree distance {Cf,ni, Cf^m+i), where the correspondence is induced by the 
natural inclusion from C that is, = {(x, x) | x G X^}. Eqn. (j^ justifies the use of 

a consistent empirical density estimator and computing d.y^, (Cj ^ + 1 ) instead. 


6 Convergence of robust single linkage 


We now analyze the robust single linkage algorithm of Chaudhuri and Dasgupta (2010 1 in the con¬ 
text of our formalism. IChaudhuri and Dasguptal (1201 Oil and |Chaudhuri et al.| (|2014|l previously 


studied the sense in which robust single linkage ensures that clusters are separated and connected at 
the appropriate levels of the empirical tree. Our analysis translates their results to our definitions of 
minimality and separation, thereby reinterpreting the convergence of robust single linkage in terms 
of our merge distortion metric. 

A simple description of the algorithm is given in Appendix [B| Essentially, the method produces 
a sequence of graphs Gr as r ranges from 0 to 00 . The sequence has a nesting property: if r < r', 
then Vr C Vr> and Er <Z E{. We interpret this sequence of graphs as a cluster tree by taking each 
connected component in any graph Gr as a cluster. We equip this cluster tree with the true density 
/ as a height function, and refer to it as C f^n in conformity with the preceding sections of this paper. 
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In what follows, assume that / is: 1) c-Lipschitz; 2) compactly supported (and hence bounded 
from above); and 3) such that {/ > A} has finitely-many connected components for any A. We will 
prove that the algorithm ensures minimality and separation. This, together with the assumptions on 
/ and Theorem|^ will imply convergence in the merge distortion distance. 

Suppose we run the robust single linkage algorithm on a sample of size n. Denote by Vd the 
volume of the d-dimensional unit hypersphere, and let B{x,r) the closed ball of radius r around x 
in We will write f{B{x, r)) to denote the probability of B{x, r) under /. Define r(A) to be the 
value of r such that Vdr‘^\ = - + —\/kd log re. Here, /c is a parameter of the algorithm which we 


will constrain, and Cs is the constant appearing in the Lemma IV. I of Chaudhuri et al. (20141. First, 
we must show that in the limit, contains no points of density less than A — e, for arbitrary e. 

Lemma 1. Fix e > 0 and A > 0. Then if a > \/2 and k > (SC^A/e)^ dlogn, there exists an N 
such that for all n> N, ifx G Gr(A)> then f{x) > A — e. 

Proof Define f = r(A — e/2). There exists an N such that for any n > N, rc < e/4. Consider 
any point x G G p By virtue of x’s mem bership in the graph, Xn contains k points within B{x,f). 


Lemma IV. I in (Chaudhuri et al. 


2014| implies that f{B{x,r)) > - — —x/kd\ogn. From 


our 


smoothness assumption, we have Vdf^{f{x) + fc) > f{B{x, f)) > - — —y/kd log n. Multiplying 


both sides by A—e/2 and substituting gives: Vdf'^{X—e/2){f{x)+rc) = + ^\/Zed log nj (/(x)+ 

fc) > (A — e/2)(^ — ^y/kdlogn) so that 


/w > - '/2) { Lggg } - - > (l - 2 ^^/^ ) (A - ./2) - e/4 


> (A-e/2)-e/4>A-e 

Hence for any point x G Gf, f{x) > A — e. Note that f > r(A), implying that any point in G^^x) is 
also in Gf. Therefore if x G G^^x)’ /(®) > A — e. □ 

We now make our claim. We will use the following fact without proof: For any A £ {f > X} 
and d > 0, there exists an N such that for all n > N, if An Xn 0, there is at least one point 
x £ An Xn with /(x) < A + d. This follows immediately from the continuity of / and the 
inequalities in the Lemma IV. I of|Chaudhuri et al. (|2014i. 


Theorem 7. Robust single linkage converges in probability to the density cluster tree Cj in the 
merge distortion distance. 

Proof. It is sufficient to prove minimality and separation, as then Theorem will imply conver¬ 
gence. Fix any e > 0, and let ^ be a connected component of {/ > A}. Define a = e/(2c), and let 
An be the set A thickened by closed balls of radius a. Define X' := inf^-g^^ /(x) > A — e/2. Theo¬ 
rem IV.7 in ( Chaudhuri et al.[|2014 1 implies that there exists an Ni such that for all n > Ni, AnXn 
is connected in G^/^x')- Take e = e/2 in our LemmaJ^ there exists an N 2 above which each point x 
in Gr(A') has density /(x) > A' — e > (A — e/2) — e/2 = A — e. Then for all n > max{A^i, W2}> 
A n Xn is connected in G^(a') level no less than A — e. This proves minimality. 

Again fix e > 0 and let A and A' be connected components of {/ > A} merging at some height 
p = mCf{A U A'). Let A and A' be the connected components of {f > p e/2} containing A 
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and A', respectively. Define a = ej (4c), and let (resp. A'^) be the set A (resp. A') thickened 
by closed balls of radius a. Define /x' := inf^g^ f{x) > ^ + e/4. Then Lemma IV.3 in 

( Chaudhuri et al.|[2014[ ) implie^that there exists some such that for all n > A^i, H and 
A' n Xn, are disconnected in Gr{fJ-') and individually connected. Let N2 be large enough that there 
exists a point xi ^ A Xn with /(xi) < /x + e. Then for all n > maxjA^i, N2}, A n X^ and 
A' n Xn are separated at level n + e. This proves separation. □ 


7 Split-tree based hierarchical clustering 


We also consider a different approach to estimate the cluster tree, using ideas from the field of 


computational topology. The method is based on the clustering algorithm proposed by Chazal et al. 


(2013), while that work focuses on analyzing flat clustering. We briefly describe our method and 
state our main result. A detailed description and the proof are relegated to Appendix [C| 

Our algorithm takes as input a set of points Pn sampled iid from a density / supported on 
an unknown Riemannian manifold, an empirically-estimated density function /„, and a parameter 
r > 0, and outputs a hierarchical clustering tree on Pn- Let Kn be the proximity graph on Pn, in 
which every point in Pn is connected to every other point that is within distance r. We then track the 
connected components of the subgraph of Kn spanned by all points P^ = {p ^ Pn ■ fn{p) > A} 
as we sweep A from high to low. The set of clusters (connected components in the subgraphs) 
produced this way and their natural nesting relations give rise to a hierarchical clustering that we 
refer to as the split-cluster tree T^. 

Comparing this with the definition of high-density cluster tree in Definition we note that 
intuitively, the split-cluster tree is a discrete approximation of the high-density cluster tree C / 
for the true density function / : A4 —>• M where (i) the density / is approximated by the empirical 
density /„; and (ii) the connectivity of the domain M. is approximated by the proximity graph Kn¬ 
it turns out that the constructed tree is related to the so-called split tree studied in the com¬ 
putational geometry and topology literature as a variant of the contour tree', see e.g, ( Carr et aL| 
2003| Wang et al. 2014 1 . Due to this relation, the split-cluster tree can be constructed efficiently in 


0{na{n)) time using a union-find data structure once nodes in Pn are sorted ( [Carr et aL||2003 1. 

Our main result is that under mild conditions, the split-cluster tree converges to the true high- 
density cluster tree C/ of / : A4 —)• M in merge distortion distance. See Appendix for full 
details. 


Theorem 8. Let M be a compact m-dimensional Riemannian manifold embedded in with 
bounded absolute curvature and positive strong convexity radius. Let f : Ai ^ M be a c-Lipschitz 
probability density function supported on A4. Let Pn be a set ofn points sampled i.i.d. according 
to f. Assume that we are given a density estimator such that ||/ — /n||oo converges to 0 as n ^ 00 . 
For any fixed e > 0, we have, with probability 1 as n ^ co, that d{Cf, Cf^n) < (4c -|- l)e, where 
the parameter r in computing the split-cluster tree Tn is set to be 2e, and Cf^n = {Pn, Tn, f) is the 
hierarchical clustering tree equipped with the height function f. 


Acknowledgements. The authors thank anonymous reviewers for insightful comments. This 
work is in part supported by the National Science Foundation (NSF) under grants CCF-1319406, 
RI-1117707, and CCF-1422830. 

^ More precisely. Lemma IV.3 requires A and A! to be so-called {a, e)-separated, for some a and e. It follows from 
the Lipschitz-continuity of / that there is some e so that A and A' are (a, tj-separated for this choice of a. 
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A Proofs 

A. 1 Proof of Theorem |2] 

Theorem 9. Let f be a density supported on X, and let {Cf^n} be a sequence of cluster trees 
computed from finite samples C X. Suppose f < M for some M G M, and that for any A, 
{x £ X : f{x) > A} contains finitely many connected components. Then 

1. If {Cf^n} ensures minimality for f, it ensures uniform minimality. 

2. If {Cf^n} ensures separation for f, it ensures uniform separation. 

Proof. We will prove the first case, in which Censures minimality. The proof of uniform sepa¬ 
ration follows closely, and is therefore omitted. 

Pick (5 > 0. Let Cy(A) denote the (finite) set of connected components of {x £ X : f{x) > A}. 
Consider the collection of connected components of superlevel sets spaced 6/2 apart: 

D= U Cf{nS/2) 

n=0 

The fact that Cy „ ensures minimality implies that for each C £ V there exists an N{C) such 
that for all n > N{C), C n is connected at level h{C) — 6/2. Let N = max^gx) N{C). This 
is well-defined, as D is a finite set. 

Let A be a connected component of{x£X\ f{x) > A} for an arbitrary A. Let = \_2\/6\ 
i.e.. A' is the largest multiple of <5/2 such that A' < A. Then A is a subset of some connected 
component X of {x £ X : f{x) > A'}. Note that A' £ V, so that A' n Xn is connected at level 
A' — 6/2. Therefore A n X^ is connected at level A' — 6/2 > (A — 6/2) — 6/2 = \ — 6. Since 
A was arbitrary, and the choice of N depended only upon 6, it follows that C „ ensures uniform 
minimality. □ 
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A.2 Proof of Theorem |5] 

Proof. Set 6 = ||/ — /||oo- Let x,x' be two arbitrary points from X. We need to show that 
\dcAx,x') — dc-{x,x')\ < 4(5, which will then implies the theorem. In what follows, we prove that 
dcf{x,x') < dcj{x,x') + 4(5. 

Let m = rriCf {x, x') denote the merge height of x and x' w.r.t. Cy. This means that there exists 
a connected component C G {y G A | /(y) > m} such that x, x' G C. Since ||/ — /||oo = d, we 
have that for any point y ^ C, \ f{y) — f{y)\ < <5 and thus f{y) > m — 6. Hence all points in C 
must belong to the same connected component, call it C(D C) G {y G A | /(y) > m — 6} with 
respect to the clustering Cj. It then follows that the merge height mc-{x, x') > m — 6. Combining 

this with that ||/ — / ||oo = <5, we have: 

dcj{x,x') = f{x) + fix') - 2mcj:ix,x') 

< fix) + (5 + fix') + 5 — 2m + 25 = dc^ (x, x') + 45. 

The proof for dcfix, x') < dc-ix, x') + (5 is symmetric. The theorem then follows. □ 

^ f 

A.3 Proof of Theorem |6] 

Proof. Set 5 := ||/i — / 2 ||oo- Let x,x' be two arbitrary points from X. We need to show that 
|dci ix, x') — dc 2 ix, x')\ < 45, which will then implies the theorem. In what follows, we prove that 
dc 2 ix, x') < dci ix, x') + 45. 

Let mi = mci ix, x') denote the merge height of x and x' w.r.t. Ci. This means that there exists 
a cluster C e C such that x,x' e C and /i(C') = mi. Since /i(C') = v+niy^c fiiu), for i = 1 , 2 , 
we thus have that / 2 (C') G [mi — 5,mi + (5]. It then follows that mc 2 ix, x') > / 2 (C') > mi — <5. 
Combining with that ||/i — / 2 II 00 = <5, we have: 

dc2ix,x') = f2ix) + /2(x') - 2mc2ix,x') 

< hix) + 5 + fiix') + 5 — 2mi + 25 = dci ix,x) + 45. 

The proof for dc^ ix, x') < dc 2 ix, x') + (5 is symmetric. The theorem then follows. □ 


B Robust single linkage 

We briefly describe the robust single linkage algorithm, and refer readers to the work of|Chaudhuri 


and Dasgupta (20101 and Chaudhuri et al. (2014 1 for details. In what follows, let Bix, r) denote the 


closed ball of radius r around x. 

The algorithm operates as follows: Given a sample Xn of n points drawn from a density / 
supported on A, and parameters a and k, perform the following steps: 

1. For each Xi G X^, set r^ixi) = min{r : r) contains k points}. 

2. As r grows from 0 to (X): 

(a) Construct a graph Gr with nodes {xi : r^ixi) < rj. Include edge (xj, Xj) if ||xi—Xj|| < 
ar. 
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(b) Let Cn(r) be the connected components of Gr- 

The algorithm produces a series of graphs as r ranges from 0 to cx). Each connected component 
in Gr for any r is considered a cluster. The clusters exhibit hierarchical structure, and can be 
interpreted as a cluster tree. We may therefore discuss the sense in which this discrete tree converges 
to the ideal density cluster tree. 


C Split-tree based hierarchical clustering 


In this section we inspect a different approach, based on the one proposed and studied by Chazal 


et al. (2013 1 , to obtain a hierarchical clustering for points sampled from a density function supported 
on a Riemannian manifold, using tools from the emerging field of computational topology. 

In particular, we focus on the following setting: Let A4 C be a smooth m-dimensional 
Riemannian manifold Af embedded in the ambient space W^, and / : Af —)• M a c-Lipschitz prob¬ 
ability density function supported on Af. Let denote a set of n points sampled i.i.d. according 
to /. We further assume that we have a density estimator /„ ; —)• M which estimates the true 

density / with the guarantee that \\f — fn\\oo < £ {n) for an error function S (n) which tends to zero 
as n —)■ +CX). 


C.l Split-cluster tree construction. 

We now describe an algorithm which takes as input and the empirical density function /„ : 
Pn —>• M, and outputs a hierarchical clustering tree on Pn- The algorithm uses a parameter 
r > 0, which intuitively should go to zero as n tends to infinity. 

Let Kn = {Pn,E) denote the 1-dimensional simplicial complex, where E := {{p,p') \ \\p — 
p'W < I"}. In other words, Kn is the proximity graph on Pn where every point in Pn is connected to 
all other points from Pn within r distance to it. We now define the following hierarchical clustering 
(cluster tree) T^: 

Given any value A, let P^ := {p ^ Pn \ fn{p) > A} be the set of vertices with estimated density 
at least A, and let be the subgraph of Kn induced by P^- The subgraph Kn may have multiple 
connected components, and the vertex set of each connected component gives rise to a cluster. The 
collection of such clusters for all A G M is T^, which we call the split-cluster tree of Pn w.r.t. /„. 
We put the parameter r in to emphasize the dependency of this cluster tree on r. 

In particular, note that the function /„ : —)• M induces a piecewise-linear (PL) function 

on the underlying space |iT„| of Kn, which we denote as / : \Kn\ —)• K- It turns out that the 
tree representation of this cluster tree is exactly the so-called split tree of this PL function / as 
studied in the literature of computational geometry and topology, as a variant of the contour tree; 
see e.g, ( |Carr et ^ 2003[ Wang et H!] 20141. This is why we refer to as split-cluster tree of 
Pn- The split-cluster tree can be easily computed in 0 {na{n)) time using the union-find data 
structure, once the vertices in Pn are already sorted ( Carr et ah] 2003 1 |. 

We note that Chazal et al. ( 2013| l proposed a clustering algorithm based on this idea, and pro¬ 
vided various nice theoretical studies of flat clusterings resulted from such a construction. We 
instead focus on the hierarchical clustering tree constructed using this split tree idea. 

Linally, recall that / : Af —)• M is the true density function. Given T^, let Cf^n = {Pn, Tn, f) be 


the corresponding cluster tree equipped with height function / : Pn 


(which is the restriction 
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of / to Pn). As before, we still use C/ to denote the high-density cluster tree w.r.t. the true density 
function /, and C/ = {A4,Cf, f) be the corresponding cluster tree equipped with height function 
f 

In what follows, we will study the convergence of the distance dj{Cf, Cf^n), where j : M x Pn 
is the natural correspondence induced by identity in Pn, that is, 7 = {{p,p) \ p G Pn (thus 
p G Ad)}. For simplicity of presentation, we will omit the reference of this natural correspondence 
7 in the remainder of this section. 


C.2 Convergence of split-cluster tree 


First, we introduce some notation. Let d{x, y) denote the Euclidean distance between any two points 
X, y G while dM{x, y) denotes the geodesic distance between points x, 7/ G A4 on the manifold 
Ad. Given a smooth manifold Ad embedded in M'^, the medial axis of Ad, denoted by Am, is the 
set of points in which has more than one nearest neighbor in Ad. The reach of M., denoted 
by p{A4), is the infimum of the closest distance from any point in Ad to the medial axis, that is, 

p{M) = d{x, Am)- _ _ 

Following the notations of Chazal et al. (20131, we further define: 


Definition 13 ((Geodesic) e-sample). Given a subset Y P M. and a parameter e > 0, a set of 
points Q C Y is a (geodesic) e-sample of Y if every point of Y is within e geodesic distance to 
some point in Q; that is, Vx G Y, min^gg dj^ix, q) < £■ 

In what follows, let Ad^ = {x G Ad | /(x) > A} be the super-level set of / : Ad —)■ M w.r.t. A. 

Lemma 2. We are given an m-dimensional smooth manifold Ad C with a c-Lipschitz density 
function / : Ad —?■ M on Af. Let p{M) be the reach o/Af. Let Pn be an e-sample of Assume 
that 11/ — /nil 00 A P for fn '■ Pn ^ IK. ond that the parameter we use to construct satisfies 
r > 2e and r < p{Ai)/2. Then d{Cf, C/^n) < maxjcr -|- 2ri, A}. 


Proof Consider any two points p, p' £ Pn- Let m and m denote the merge height of p and p' in Cj 
and in C/,n. respectively. By definition, we have that 


^f,n) 


max \mcAp,p') - ip,p')\- 

p,p'eP„ ^ 


( 2 ) 


We now distinguish two cases: 

Case 1: m > A. 


In this case, by definition of the merge height m of p and p', we know that: (I) f{p), f{p') > m, 
and thus both p and p’ are from Pn H Ad^, and (2) p and p' are connected in Af™, thus there is a 
path F C Af ^ connecting p and p' such that for any x G F, /(x) > m. We now show that the merge 
height of p and p' in C/_n satisfies m G [m — cr — 2r/, m -\- cr -\- 2rj\. 

Indeed, let vr : Af ^ —)• Pn be the projection map that sends any x G Ad^ to its nearest neighbor 
inPn. Since Pn is an e-sample for Af^, we have that 7r(x)) < eandthus |/(x) —/(7r(x))| < 
ce (as / is c-Lipschitz), for any x G Af Consider any two sufficiently close points x, x' G F (i.e, 
||x — x'll < r — 2e), we have that (1) either 7r(x) = 7r(x'), (2) or 7r(x) / 7r(x') but 


7 r(x) — 7 r(x') II < 


-7r(x')|| < r. 
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In other words, vr(r) consists of a sequence of vertices p = qi, ^ 2 ) • • • j fis = in Kn such that 
there is an edge in Kn connecting any two consecutive qi, qi+i, i £ [1, s — 1]. The concatenation of 
these edges forms a path F' = {p = qi,q 2 , ■ ■ ■ iQs = p') in Kn connecting p to p'. Since for each 
i G [1, s — 1], qi = 7r(x) for some x G F, we have that 

/(®) = > f{x) - ce > m - ce, 

where the two inequalities follow from that \f{x) — /(7r(x))| < ce and f{x) > m for any x G F. 
Since ||/ - fn\\ < Tj, it then follows that f{qi) > m — ce — r/ for any qi G F'. 

Recall that denotes the subgraph of Kn induced by the set of points P" = {p G Pn | 
fn{p) > «} whose function value w.r.t. the empirical density function /„ is at least a. It then 
follows that p and p' should be connected in iF" for a = m — ce — ry. It then follows that the merge 
height 


rh = {p,p') > min /(q) > min fn{q) —r]>a — rj = m — ce — 2ry. 


We now show the other direction, namely m > m — cr — 2p. Indeed, by definition of rh, there is 
apathF = {qi = p,q2, ■ ■. ,qt = p') in connecting p and p'such that for any q* G F,/(qj) > m. 
Now let {q,q') denote a minimizing geodesic between two points M.. We then have that 

there is a path 

f' := iM{Qi,Q2) o (M{Q2,q3) o • • ■ o^M{qt-i,qt) 


in A4 connecting p = qi to p' = qt. 

At the same time, note that for any two consecutive nodes qi and qj+i from F, we know 11% — 
%+i II < r as (%, %+i) is an edge in Kn- For r < p{M.)/2, where p{.M) is the reach of the manifold 
At, by Proposition 1.2 of |Dey et al. (20111, we have that the geodesic distance %+i) is at 

most |||% — %+i||. Thus < |r. In particular, for any point x G •^x(%,%+i), it is 

within |r distance to either % or %+i. Hence we have that 


/(x) > min{/(%),/(%_|_i)}-cr > m — cr m = mcf{p,p') > xain f{x) > m — cr. 

3 xef' 


Putting everything together, we have that |m — m| < cr + 2r] for the case m > X. 


Case 2: m < X. 


First, note that the proof of m > m — cr holds regardless of the value of m. Hence we have 
m — m < cr for the case m < A as well. On the other hand, since m < X, m — m < X. Thus 
|m — m| < max{cr, A}. 

The lemma follows from combining these two cases with Eqn. (|^. □ 


Remark: The bound in the above result can be large if the value A is large. We can obtain a 
stronger result for points in P„ n which is independent of A. However, the above result is 
cleaner to present and it suffices fo prove our main convergence resulf in Theorem [TT] 

To obfain a convergence resulf, we need fo incur fhe following resulfs from|Chazal ef al. (20131. 
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Definition 14 (Chazal et al. (20131). Let Ai be an m-dimensional Riemannian manifold with in¬ 
trinsic metric d^- Given a subset A C. A4 and a parameter r > 0, define Vr'(^) to be the infimum 
of the Hausdorff measures achieved by geodesic balls of radius r centered in A; that is: 


Vr(^) = inf LL'^{BM{x,r)), where BM{x,r) ■.= {y e M \ dM{x,y) < r]. (3) 

x^A 


We also define the r-covering number of A, denoted by Mr (A) to be the minimum number of closed 
geodesic balls of radius r needed to cover A (the balls do not have to be centered in A). 


Theorem 10 (Theorem 7.2 of Chazal et al. ( 2013| l). Let M. be an m-dimensional Riemannian 
manifold and f : Ai ^ M a c-Lipschitz probability density function. Consider a set P sampled 
according to f in Ltd. fashion. Then, for any parameter e > 0 and a > ce, we are guaranteed that 
P forms an e-sample o/Af“ with probability at least 1 — \ 


Remarks. For simplicity, we now focus on the case where Af is a compact smooth embedded 
manifold with bounded absolute sectional curvature and positive strong convexity radius pc{A4). 
It follows from the Giinther-Bishop Theorem that (see e.g, Appendix B of Buchet et al.| ( |2014[ )) 
in this case, there exists a constant p depending only on the intrinsic property of Ai such that 
Vr{Ai°‘) > Vr{A\) > for sufficiently small r. Due to the compactness of Ai, this further 
gives an upper bound on Mr{Ai) (and thus for Mr{Ai°‘) < Mr{Ai)). Thus for fixed e and a, Pn 
forms an e-sample for Af" wifh probabilify 1 as n —)• +oo. 

We remark fhaf Lemma 7.3 of ( 


Chazal ef al. 


(20131 also sfafes fhaf A 4 / 2 (.^“) < +oo (i-e> it is 
finife) and V£/ 2 (A 1 ") > 0 for fhe more general case where Af is a complefe Riemannian manifold 
wifh bounded absolufe sectional curvafure, for any e < 2pc(Af). Hence again for fixed e and a, Pn 
forms an e-sample for Af" wifh probabilify 1 as n —)• +oo. 

Puffing Theorem]^ and [^fogefher, we obfain fhe following: 


Theorem 11. Let Ai be a compact m-dimensional Riemannian manifold embedded in with 
positive strong convexity radius. Let / : Af —?■ M a c-Lipschitz probability density function 
supported on Af. Let Pn be a set of n points sampled i.i.d. according to f. Assume that we are 
given a density estimator such that ||/ — /n,||oo converges to Q as n ^ oo. For any fixed e > 0 , 
we have, with probability 1 as n ^ oo, that d(Cf, Cy „) < (4c -|- l)e, where the parameter r in 
computing the split-cluster tree is set to be 2e. 

Proof. Sef A in Lemma^fo be 2ce. We fhen have fhaf d{Cf, Cf^n) < 2ce + 2ce -|- 2\\f — fn\\oo if 
Pn is an e-sample of Af . Since ||/ — /n||oo converges fo 0 as n fends fo oo, fhere exisfs such 
fhaf 11/ — fnWoo < e for any n > N^. Hence d{Cf, Cf,n) < (4c -|- l)e if Pn is an e-sample of Af^ 
and for n > N^. The fheorem follows from fhis and Theorem [T^ above. □ 
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